extraChIPs: Package Demo
extraChIPs 1.5.9
The talk accompanying this material can be found here
extraChIPs was developed to provide key infrastructure for a larger ChIP-Seq workflow comparing binding dynamics across multiple ChIP targets.
The focus is on:
GRanges objects containing important values in the mcols() elementThe package was developed to utilise existing S4 Object Classes and behave in a tidy-friendly manner, whilst not being strictly designed under tidy programming guidelines. Some functions may be analogous to those in existing packages, but the aim is to be simple and intuitive to use, whilst simplifying high-performance analyses. Previous work by the authors of DiffBind (Ross-Innes et al. 2012) and csaw (Lun and Smyth 2016) has been instrumental in the package design, along with the tidyverse (Wickham et al. 2019) and edgeR (Chen, Lun, and Smyth 2016). The intention of the package is to integrate easily with these packages, wrap multiple functions where appropriate, and provide new tools for both visualisation and working with results.
Today’s workshop will focus on the devel version of the package, which can be installed using
BiocManager::install(
"extraChIPs", version = "devel", update = FALSE, ask = FALSE
)
Alternatively, this can be installed directly from github
BiocManager::install(
"smped/extraChIPs", ref = "devel", update = FALSE, ask = FALSE
)
Additional packages required can be installed as follows:
pkg <- c(
"rtracklayer", "tidyverse", "glue", "plyranges", "Rsamtools", "csaw",
"edgeR","patchwork"
)
BiocManager::install(pkgs = pkg, update = FALSE, ask = FALSE)
library(tidyverse)
library(glue)
library(plyranges)
library(extraChIPs)
library(Rsamtools)
library(csaw)
library(edgeR)
library(rtracklayer)
library(patchwork)
theme_set(theme_bw())
Data can be obtained from here and should be placed in a folder called data within your current working directory.
This data has been pre-processed here using the prepareChIPs workflow (Pederson 2023), and subset to a region on Chromosome 10 for easier working in today’s context.
The two targets in this dataset are the Estrogen Receptor (ER\(\alpha\)) and the histone mark H3K27ac. Data has been taken from ZR-75-1 cells cultured in Estrogen (E2) or Estrogen + Dihydrotestosterone (E2DHT), and was first published by researchers from the Dame Roma Mitchell Cancer Research Laboratories (Hickey et al. 2021)
The sample sheet for the ER\(\alpha\) samples can be setup by copying the following code. As can be seen, we have three samples from the E2 (baseline) group with three more treated by the addition of DHT.
targets <- c("ER", "H3K27ac")
treat_levels <- c("E2", "E2DHT")
samples <- tibble(
accession = paste0("SRR83151", 80:91),
target = rep(targets, each = 6),
treatment = treat_levels %>%
rep(each = 3) %>%
rep_len(12) %>%
factor(levels = treat_levels),
input = "SRR8315192"
)
## Split for easier wrangling
samples <- split(samples, samples$target)
samples
## $ER
## # A tibble: 6 × 4
## accession target treatment input
## <chr> <chr> <fct> <chr>
## 1 SRR8315180 ER E2 SRR8315192
## 2 SRR8315181 ER E2 SRR8315192
## 3 SRR8315182 ER E2 SRR8315192
## 4 SRR8315183 ER E2DHT SRR8315192
## 5 SRR8315184 ER E2DHT SRR8315192
## 6 SRR8315185 ER E2DHT SRR8315192
##
## $H3K27ac
## # A tibble: 6 × 4
## accession target treatment input
## <chr> <chr> <fct> <chr>
## 1 SRR8315186 H3K27ac E2 SRR8315192
## 2 SRR8315187 H3K27ac E2 SRR8315192
## 3 SRR8315188 H3K27ac E2 SRR8315192
## 4 SRR8315189 H3K27ac E2DHT SRR8315192
## 5 SRR8315190 H3K27ac E2DHT SRR8315192
## 6 SRR8315191 H3K27ac E2DHT SRR8315192
We’ll also setup colours for each treatment for consistent visualisations throughout the analysis
treat_colours <- setNames(c("steelblue", "red3"), treat_levels)
Finally, we’ll define a Seqinfo object just for today’s data.
sq <- Seqinfo(
seqnames = "chr10", seqlengths = 135534747, isCircular = FALSE,
genome = "hg19"
)
*MC() FunctionsMany functions exist for combining sets of ranges, such as reduce(), intersect(), setdiff() etc, however these have been extended in extraChIPs to reduceMC(), intersectMC(), setdiffMC() and unionMC() as well as tidy-aligned functions chopMC() and distinctMC().
This group of functions behaves identically to the core GenomicRanges versions, but retains data in the mcols element.
These are used internally by many functions within the package, but also have many stand-alone uses.
Define a simple set of ranges, with ids in the mcols
x <- GRanges(c("chr1:1-10", "chr1:6-12"))
x$id <- c("range1", "range2")
x$gene <- "gene1"
x
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 1-10 * | range1 gene1
## [2] chr1 6-12 * | range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
When calling reduce() we lose all mcols information.
reduce(x)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-12 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
These are retained when calling reduceMC
reduceMC(x)
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <CharacterList> <character>
## [1] chr1 1-12 * | range1,range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
All other functions behave in a similar manner
y <- GRanges(c("chr1:11-15"))
intersect(x, y)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 11-12 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
intersectMC(x, y)
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 11-12 * | range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
setdiff(x, y)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-10 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
setdiffMC(x, y)
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <CharacterList> <character>
## [1] chr1 1-10 * | range1,range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
union(x, y)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-15 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
unionMC(x, y)
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <CharacterList> <character>
## [1] chr1 1-15 * | range1,range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
tibble CoercionAs this package is also designed to play well with the tidyverse the functions as_tibble() and colToRanges() have been introduced.
When coercing to a tibble, the GRanges element is coerced to a character vector.
tbl <- as_tibble(x)
tbl
## # A tibble: 2 × 3
## range id gene
## <chr> <chr> <chr>
## 1 chr1:1-10 range1 gene1
## 2 chr1:6-12 range2 gene1
The tibble can be returned to the more conventional set of columns by setting rangeAsChar = FALSE
as_tibble(x, rangeAsChar = FALSE)
## # A tibble: 2 × 7
## seqnames start end width strand id gene
## <fct> <int> <int> <int> <fct> <chr> <chr>
## 1 chr1 1 10 10 * range1 gene1
## 2 chr1 6 12 7 * range2 gene1
Additional methods have been implemented for easy coercion of DataFrame, GInteractions, Seqinfo, SummarzedExperiment and TopTags objects
colToRangesCoercion is implemented from tibble to GRanges using colToRanges()
colToRanges(tbl, var = "range")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 1-10 * | range1 gene1
## [2] chr1 6-12 * | range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
colToRanges(tbl, var = "range", seqinfo = seqinfo(x))
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 1-10 * | range1 gene1
## [2] chr1 6-12 * | range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Whilst the seqinfo is clearly lost in the reverse coercion, colToRanges() a seqinfo object can be provided.
colToRanges can also move GRanges objects in mcols to the ‘backbone’ ranges.
Here we’ll add a column which contains the peak centre to our original set of ranges, then we’ll move that to become the ‘backbone’ ranges.
x$centre <- GRanges(paste0("chr1:", c(5, 10)), seqinfo = seqinfo(x))
x
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | id gene centre
## <Rle> <IRanges> <Rle> | <character> <character> <GRanges>
## [1] chr1 1-10 * | range1 gene1 chr1:5
## [2] chr1 6-12 * | range2 gene1 chr1:10
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
colToRanges(x, var = "centre")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | id gene
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 5 * | range1 gene1
## [2] chr1 10 * | range2 gene1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Broad and Narrow Peak files as output by peak callers such as macs2 callpeak (Zhang et al. 2008) can be imported using importPeaks().
This will always return a GRangesList for simple downstream analysis.
If importing narrowPeak files, the peak centre can be automatically added.
fl <- glue("data/ER/{samples$ER$accession}_peaks.narrowPeak")
peaks <- importPeaks(fl, seqinfo = sq, centre = TRUE)
names(peaks) <- str_remove_all(names(peaks), "_peaks.narrowPeak")
(Names can also be set using glue syntax, however the above is generally more easily comprehended at first glance.)
Blacklisted regions and grey-listed regions are also able to be applied during import. Blacklisted regions are usually available from public sources, whilst grey-listed regions are derived from a relevant input.
These have also been provided and can be loaded separately then combined.
bg_list <- file.path(
"data", "annotations", c("chr10_blacklist.bed", "chr10_greylist.bed")
) %>%
lapply(import.bed, seqinfo = sq) %>%
lapply(granges) %>%
GRangesList() %>%
unlist() %>%
sort()
peaks <- importPeaks(fl, seqinfo = sq, centre = TRUE, blacklist = bg_list)
names(peaks) <- str_remove_all(names(peaks), "_peaks.narrowPeak")
Now we’ve loaded our peaks and excluded peaks within unreliable ranges, one of the first tasks might be to check how similar our replicates are.
This can easily be performed using plotOverlaps()
plotOverlaps(peaks, set_col = treat_colours[samples$ER$treatment])
Figure 1: Overlapping peaks between all replicates
plotOverlaps() calls the plotting functions provided by ComplexUpset::upset() (Krassowski 2020) and as such, can handle any additional parameters which are understood by this function.
plotOverlaps(
peaks, .sort_sets = FALSE,
set_col = treat_colours[samples$ER$treatment], n_intersections = 10
)
Figure 2: Overlapping peaks, showing only intersections with 10 or more members and retaining the original sample ordering
The next step might be to form a set of consensus peaks for each treatment. Just forming a set of consensus peaks for the E2 samples can be done manually
makeConsensus(peaks[1:3])
## GRanges object with 244 ranges and 4 metadata columns:
## seqnames ranges strand | SRR8315180 SRR8315181 SRR8315182
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical>
## [1] chr10 43048195-43048529 * | TRUE TRUE TRUE
## [2] chr10 43521739-43522260 * | TRUE TRUE TRUE
## [3] chr10 43540042-43540390 * | TRUE FALSE TRUE
## [4] chr10 43606238-43606573 * | TRUE TRUE TRUE
## [5] chr10 43851214-43851989 * | FALSE TRUE TRUE
## ... ... ... ... . ... ... ...
## [240] chr10 99168353-99168649 * | TRUE TRUE TRUE
## [241] chr10 99207868-99208156 * | FALSE TRUE TRUE
## [242] chr10 99326100-99326363 * | FALSE FALSE TRUE
## [243] chr10 99331363-99331730 * | TRUE TRUE TRUE
## [244] chr10 99621632-99621961 * | FALSE TRUE TRUE
## n
## <numeric>
## [1] 3
## [2] 3
## [3] 2
## [4] 3
## [5] 2
## ... ...
## [240] 3
## [241] 2
## [242] 1
## [243] 3
## [244] 2
## -------
## seqinfo: 1 sequence from hg19 genome
In reality, we may prefer to only retain peaks found in 2/3 samples and we can do this by setting p = 2/3.
We can also also makeConsensus to retain one or more columns from the original peaks using var =
makeConsensus(peaks[1:3], p = 2/3, var = "centre")
## GRanges object with 164 ranges and 5 metadata columns:
## seqnames ranges strand | centre
## <Rle> <IRanges> <Rle> | <NumericList>
## [1] chr10 43048195-43048529 * | 43048341,43048357,43048389
## [2] chr10 43521739-43522260 * | 43522008,43522012,43522039
## [3] chr10 43540042-43540390 * | 43540298,43540245
## [4] chr10 43606238-43606573 * | 43606418,43606408,43606421
## [5] chr10 43851214-43851989 * | 43851672,43851766
## ... ... ... ... . ...
## [160] chr10 99096784-99097428 * | 99097259,99097253,99097249
## [161] chr10 99168353-99168649 * | 99168506,99168486,99168513
## [162] chr10 99207868-99208156 * | 99207995,99208002
## [163] chr10 99331363-99331730 * | 99331621,99331585,99331580
## [164] chr10 99621632-99621961 * | 99621809,99621828
## SRR8315180 SRR8315181 SRR8315182 n
## <logical> <logical> <logical> <numeric>
## [1] TRUE TRUE TRUE 3
## [2] TRUE TRUE TRUE 3
## [3] TRUE FALSE TRUE 2
## [4] TRUE TRUE TRUE 3
## [5] FALSE TRUE TRUE 2
## ... ... ... ... ...
## [160] TRUE TRUE TRUE 3
## [161] TRUE TRUE TRUE 3
## [162] FALSE TRUE TRUE 2
## [163] TRUE TRUE TRUE 3
## [164] FALSE TRUE TRUE 2
## -------
## seqinfo: 1 sequence from hg19 genome
By default, this takes the union of the ranges called as peaks in each sample. We may instead prefer to just retain regions covered in \(\geq\) 2/3 samples. Not that whilst we have the same number of ranges in this particular example, the ranges are tighter.
makeConsensus(peaks[1:3], p = 2/3, method = "coverage")
## GRanges object with 164 ranges and 4 metadata columns:
## seqnames ranges strand | SRR8315180 SRR8315181 SRR8315182
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical>
## [1] chr10 43048224-43048519 * | TRUE TRUE TRUE
## [2] chr10 43521773-43522218 * | TRUE TRUE TRUE
## [3] chr10 43540200-43540384 * | TRUE FALSE TRUE
## [4] chr10 43606264-43606559 * | TRUE TRUE TRUE
## [5] chr10 43851570-43851940 * | FALSE TRUE TRUE
## ... ... ... ... . ... ... ...
## [160] chr10 99097038-99097398 * | TRUE TRUE TRUE
## [161] chr10 99168367-99168608 * | TRUE TRUE TRUE
## [162] chr10 99207908-99208139 * | FALSE TRUE TRUE
## [163] chr10 99331412-99331687 * | TRUE TRUE TRUE
## [164] chr10 99621674-99621944 * | FALSE TRUE TRUE
## n
## <numeric>
## [1] 3
## [2] 3
## [3] 2
## [4] 3
## [5] 2
## ... ...
## [160] 3
## [161] 3
## [162] 2
## [163] 3
## [164] 2
## -------
## seqinfo: 1 sequence from hg19 genome
In order to make consensus peaks for each condition, we can split by treatment group, then use a simple lapply() trick, to work with both treatment groups together.
cons_peaks <- peaks %>%
split(f = samples$ER$treatment) %>%
lapply(makeConsensus, p = 2/3, var = "centre") %>%
lapply(select, centre) %>%
lapply(mutate, centre = vapply(centre, mean, numeric(1))) %>%
GRangesList()
A lot happened in that code chunk, so let’s break it down
split(f = samples$ER$treatment)), thenlapply(makeConsensus, p = 2/3, var = "centre"))mcols to only retain the peak centre (lapply(select, centre)).NumericList, so we took the mean of these for each consensus peakGRangeListNow we can compare our two treatment conditions using plotOverlaps() again, which will return a Venn Diagram when passing a GRangesList of length 2.
plotOverlaps(cons_peaks, set_col = treat_colours)
Figure 3: Overlaps betwen consensus peaks specific to each treatment group
Finally, we can form a set of “union” peaks, defined as those within a consensus peak for either treatment group.
union_peaks <- makeConsensus(cons_peaks, var = "centre") %>%
mutate(centre = vapply(centre, mean, numeric(1)) %>% round(0))
Now we have detected our ranges where ER\(\alpha\) appears to have bound, we might wish to find which genes are likely targets and what type of genomic feature is associated with each range. First let’s define some genomic regions.
A Gencode GTF has been provided in data, restricted to the regions of chr10 we’re working with today.
We can load this then, split based on the feature type, giving a GRangesList with gene, transcript and exon-level features.
gtf <- file.path(
"data", "annotations", "gencode.v43lift37.chr10.annotation.gtf.gz"
) %>%
import.gff() %>%
split(.$type)
seqinfo(gtf) <- sq
The function defineRegions() enables us to define 1) Promoter, 2) Upstream Promoter, 3) Intron, 4) Exon, 5) Proximal Intergenic and 6) Distal Intergenic Regions.
The distances for these regions can be specified by the user, but here we’ll just use the defaults.
regions <- defineRegions(
genes = gtf$gene, transcripts = gtf$transcript, exons = gtf$exon
)
regions
## GRangesList object of length 6:
## $promoter
## GRanges object with 3359 ranges and 5 metadata columns:
## seqnames ranges strand | region
## <Rle> <IRanges> <Rle> | <character>
## [1] chr10 61989-64988 * | Promoter (-2500/+500)
## [2] chr10 66360-69392 * | Promoter (-2500/+500)
## [3] chr10 88152-91151 * | Promoter (-2500/+500)
## [4] chr10 94710-97961 * | Promoter (-2500/+500)
## [5] chr10 119604-122603 * | Promoter (-2500/+500)
## ... ... ... ... . ...
## [3355] chr10 135488075-135491074 * | Promoter (-2500/+500)
## [3356] chr10 135491384-135494383 * | Promoter (-2500/+500)
## [3357] chr10 135494684-135497683 * | Promoter (-2500/+500)
## [3358] chr10 135503135-135506134 * | Promoter (-2500/+500)
## [3359] chr10 135513071-135518323 * | Promoter (-2500/+500)
## gene_id
## <CharacterList>
## [1] ENSG00000260370.2_11
## [2] ENSG00000260370.2_11,ENSG00000260370.2_11,ENSG00000260370.2_11
## [3] ENSG00000237297.1_7
## [4] ENSG00000261456.6_7,ENSG00000261456.6_7,ENSG00000261456.6_7,...
## [5] ENSG00000261456.6_7
## ... ...
## [3355] ENSG00000276046.1_5
## [3356] ENSG00000276904.1_5
## [3357] ENSG00000278641.1_5
## [3358] ENSG00000226880.1
## [3359] ENSG00000213147.3_3,ENSG00000261914.2_3,ENSG00000282875.1_8,...
## gene_name
## <CharacterList>
## [1] ENSG00000260370
## [2] ENSG00000260370,ENSG00000260370,ENSG00000260370
## [3] ENSG00000237297
## [4] TUBB8,TUBB8,TUBB8,...
## [5] TUBB8
## ... ...
## [3355] DUX4L13
## [3356] DUX4L14
## [3357] DUX4L15
## [3358] XX-2136C48.7
## [3359] RPL23AP60,RPL23AP84,ENSG00000282875,...
## transcript_id
## <CharacterList>
## [1] ENST00000562162.2_4
## [2] ENST00000682078.1_1,ENST00000566940.2_4,ENST00000683798.1_1
## [3] ENST00000416477.1_2
## [4] ENST00000568584.6_6,ENST00000568866.5_4,ENST00000561967.1_4,...
## [5] ENST00000564130.2_4
## ... ...
## [3355] ENST00000554103.2_2
## [3356] ENST00000621227.1_2
## [3357] ENST00000611059.1_2
## [3358] ENST00000411632.1
## [3359] ENST00000450011.1_2,ENST00000571193.2_2,ENST00000635574.1_4,...
## transcript_name
## <CharacterList>
## [1] ENST00000562162
## [2] ENST00000682078,ENST00000566940,ENST00000683798
## [3] ENST00000416477
## [4] TUBB8-206,TUBB8-207,TUBB8-201,...
## [5] TUBB8-204
## ... ...
## [3355] DUX4L13-201
## [3356] DUX4L14-201
## [3357] DUX4L15-201
## [3358] XX-2136C48.7-001
## [3359] RPL23AP60-201,RPL23AP84-201,ENST00000635574,...
## -------
## seqinfo: 1 sequence from hg19 genome
##
## ...
## <5 more elements>
Now we can assign these to the peaks using bestOverlap
union_peaks$region <- bestOverlap(union_peaks, regions)
union_peaks
## GRanges object with 188 ranges and 5 metadata columns:
## seqnames ranges strand | centre E2 E2DHT
## <Rle> <IRanges> <Rle> | <numeric> <logical> <logical>
## [1] chr10 43048195-43048529 * | 43048372 TRUE TRUE
## [2] chr10 43521739-43522260 * | 43522014 TRUE TRUE
## [3] chr10 43540042-43540457 * | 43540310 TRUE TRUE
## [4] chr10 43606184-43606573 * | 43606417 TRUE TRUE
## [5] chr10 43851214-43851989 * | 43851719 TRUE FALSE
## ... ... ... ... . ... ... ...
## [184] chr10 99168349-99168690 * | 99168489 TRUE TRUE
## [185] chr10 99207862-99208157 * | 99208004 TRUE TRUE
## [186] chr10 99331323-99331741 * | 99331592 TRUE TRUE
## [187] chr10 99340107-99340354 * | 99340248 FALSE TRUE
## [188] chr10 99621632-99621988 * | 99621798 TRUE TRUE
## n region
## <numeric> <character>
## [1] 2 promoter
## [2] 2 distal_intergenic
## [3] 2 distal_intergenic
## [4] 2 intron
## [5] 1 proximal_intergenic
## ... ... ...
## [184] 2 intron
## [185] 2 promoter
## [186] 2 promoter
## [187] 1 upstream_promoter
## [188] 2 upstream_promoter
## -------
## seqinfo: 1 sequence from hg19 genome
When providing a GRangesList the overlaps are defined by the element names.
Alternatively, we could pass a GRanges object and specify the name from the mcols element we wish to us.
union_peaks$region <- bestOverlap(union_peaks, unlist(regions), var = "region")
union_peaks
## GRanges object with 188 ranges and 5 metadata columns:
## seqnames ranges strand | centre E2 E2DHT
## <Rle> <IRanges> <Rle> | <numeric> <logical> <logical>
## [1] chr10 43048195-43048529 * | 43048372 TRUE TRUE
## [2] chr10 43521739-43522260 * | 43522014 TRUE TRUE
## [3] chr10 43540042-43540457 * | 43540310 TRUE TRUE
## [4] chr10 43606184-43606573 * | 43606417 TRUE TRUE
## [5] chr10 43851214-43851989 * | 43851719 TRUE FALSE
## ... ... ... ... . ... ... ...
## [184] chr10 99168349-99168690 * | 99168489 TRUE TRUE
## [185] chr10 99207862-99208157 * | 99208004 TRUE TRUE
## [186] chr10 99331323-99331741 * | 99331592 TRUE TRUE
## [187] chr10 99340107-99340354 * | 99340248 FALSE TRUE
## [188] chr10 99621632-99621988 * | 99621798 TRUE TRUE
## n region
## <numeric> <character>
## [1] 2 Promoter (-2500/+500)
## [2] 2 Intergenic (>10kb)
## [3] 2 Intergenic (>10kb)
## [4] 2 Intron
## [5] 1 Intergenic (<10kb)
## ... ... ...
## [184] 2 Intron
## [185] 2 Promoter (-2500/+500)
## [186] 2 Promoter (-2500/+500)
## [187] 1 Upstream Promoter (-..
## [188] 2 Upstream Promoter (-..
## -------
## seqinfo: 1 sequence from hg19 genome
Let’s set these as a factor in order as contained in the original set of regions.
reg_levels <- vapply(regions, function(x) x$region[1], character(1))
union_peaks$region <- factor(union_peaks$region, levels = reg_levels)
We can check the distribution of our peaks against these regions using plotPie()
region_colours <- hcl.colors(6, "viridis", rev = TRUE)
names(region_colours) <- reg_levels
plotPie(union_peaks, fill = "region") +
scale_fill_manual(values = region_colours)
Figure 4: Distribution of peaks within defined regulatory regions
The labels for plotPie() use the glue syntax, lending enormous flexibility to what can be placed here.
Whist the defaults are hopefully useful, in the code below. we’ll modify these to wrap the text using str_wrap() and add n = to the peak numbers.
plotPie(
union_peaks, fill = "region", total_size = 5,
cat_glue = "{str_wrap(.data[[fill]], 15)}\nn = {comma(n, 1)}\n({percent(p, 0.1)})"
) +
scale_fill_manual(values = region_colours)
Figure 5: Distribution of peaks with manually tweaked segment labels
Finally, we might wish to know which genes are likely targets for each peak.
mapByFeature() uses a mapping strategy which maps peaks/ranges to genes differently depending on which type of feature they overlap.
union_peaks <- mapByFeature(
union_peaks, genes = gtf$gene, prom = regions$promoter
)
union_peaks %>% filter(str_detect(region, "^Promoter"), !E2)
## GRanges object with 7 ranges and 7 metadata columns:
## seqnames ranges strand | centre E2 E2DHT
## <Rle> <IRanges> <Rle> | <numeric> <logical> <logical>
## [1] chr10 54169026-54169221 * | 54169133 FALSE TRUE
## [2] chr10 63808711-63809004 * | 63808893 FALSE TRUE
## [3] chr10 69847132-69847384 * | 69847254 FALSE TRUE
## [4] chr10 71879941-71880280 * | 71880136 FALSE TRUE
## [5] chr10 74114141-74114500 * | 74114282 FALSE TRUE
## [6] chr10 75579311-75579548 * | 75579425 FALSE TRUE
## [7] chr10 88281479-88281723 * | 88281596 FALSE TRUE
## n region
## <numeric> <factor>
## [1] 1 Promoter (-2500/+500)
## [2] 1 Promoter (-2500/+500)
## [3] 1 Promoter (-2500/+500)
## [4] 1 Promoter (-2500/+500)
## [5] 1 Promoter (-2500/+500)
## [6] 1 Promoter (-2500/+500)
## [7] 1 Promoter (-2500/+500)
## gene_id gene_name
## <CharacterList> <CharacterList>
## [1] ENSG00000227972.1_6 THAP12P3
## [2] ENSG00000150347.17_12 ARID5B
## [3] ENSG00000138347.17_9 MYPN
## [4] ENSG00000042286.15_8 AIFM2
## [5] ENSG00000148719.16_11 DNAJB12
## [6] ENSG00000148660.22_13 CAMK2G
## [7] ENSG00000062650.20_11,ENSG00000227896.2_11 WAPL,WAPL-DT
## -------
## seqinfo: 1 sequence from hg19 genome
Now that we’ve defined our set of peaks present under either condition, we can use these to perform a differential signal analysis.
Existing packages such as DiffBind use fixed-width regions to minimise bias from variable sized regions.
Using extraChIPs we can define these ranges using our peak centres that we’ve carried forward.
centred_peaks <- union_peaks %>%
mutate(
centre = paste0(seqnames, ":", centre),
union_peak = granges(.)
) %>%
colToRanges(var = "centre", seqinfo = sq) %>%
resize(width = 400, fix = "center")
centred_peaks
## GRanges object with 188 ranges and 7 metadata columns:
## seqnames ranges strand | E2 E2DHT n
## <Rle> <IRanges> <Rle> | <logical> <logical> <numeric>
## [1] chr10 43048172-43048571 * | TRUE TRUE 2
## [2] chr10 43521814-43522213 * | TRUE TRUE 2
## [3] chr10 43540110-43540509 * | TRUE TRUE 2
## [4] chr10 43606217-43606616 * | TRUE TRUE 2
## [5] chr10 43851519-43851918 * | TRUE FALSE 1
## ... ... ... ... . ... ... ...
## [184] chr10 99168289-99168688 * | TRUE TRUE 2
## [185] chr10 99207804-99208203 * | TRUE TRUE 2
## [186] chr10 99331392-99331791 * | TRUE TRUE 2
## [187] chr10 99340048-99340447 * | FALSE TRUE 1
## [188] chr10 99621598-99621997 * | TRUE TRUE 2
## region gene_id
## <factor> <CharacterList>
## [1] Promoter (-2500/+500) ENSG00000234420.9_10,ENSG00000270762.1_3
## [2] Intergenic (>10kb) ENSG00000263795.1
## [3] Intergenic (>10kb) ENSG00000165731.21_15
## [4] Intron ENSG00000165731.21_15
## [5] Intergenic (<10kb) ENSG00000285712.1_7
## ... ... ...
## [184] Intron ENSG00000052749.14_14,ENSG00000231970.1_10
## [185] Promoter (-2500/+500) ENSG00000171311.13_12,ENSG00000171307.19_9
## [186] Promoter (-2500/+500) ENSG00000165887.12_14
## [187] Upstream Promoter (-5kb) ENSG00000165887.12_14
## [188] Upstream Promoter (-5kb) ENSG00000155265.11_7
## gene_name union_peak
## <CharacterList> <GRanges>
## [1] ZNF37BP,FXYD6P1 chr10:43048195-43048529
## [2] MIR5100 chr10:43521739-43522260
## [3] RET chr10:43540042-43540457
## [4] RET chr10:43606184-43606573
## [5] ENSG00000285712 chr10:43851214-43851989
## ... ... ...
## [184] RRP12,ENSG00000231970 chr10:99168349-99168690
## [185] EXOSC1,ZDHHC16 chr10:99207862-99208157
## [186] ANKRD2 chr10:99331323-99331741
## [187] ANKRD2 chr10:99340107-99340354
## [188] GOLGA7B chr10:99621632-99621988
## -------
## seqinfo: 1 sequence from hg19 genome
In the above, we’ve 1) created a new range from the centre-points, 2) placed the original range as a new column in the mcols element, 3) moved the centred range to being the primary range using colToRanges().
From there, we’ve resized to a standard width of 400bp.
We can now use these ranges to count alignments across our set of samples.
The bam files should be in the data directory and can be formed into a BamFileList, before being passed to regionCounts() from csaw.
This will return a RangedSummarizedExperiment with totals from the entire bam file in the totals column of the colData element
er_bfl <- file.path("data", "ER", paste0(samples$ER$accession, ".bam")) %>%
BamFileList() %>%
setNames(samples$ER$accession)
er_se <- regionCounts(er_bfl, centred_peaks, ext = 200)
seqinfo(er_se) <- sq
er_se
## class: RangedSummarizedExperiment
## dim: 188 6
## metadata(2): final.ext param
## assays(1): counts
## rownames: NULL
## rowData names(7): E2 E2DHT ... gene_name union_peak
## colnames(6): SRR8315180 SRR8315181 ... SRR8315184 SRR8315185
## colData names(4): bam.files totals ext rlen
colData(er_se)
## DataFrame with 6 rows and 4 columns
## bam.files totals ext rlen
## <character> <integer> <integer> <integer>
## SRR8315180 data/ER/SRR8315180.bam 317845 200 75
## SRR8315181 data/ER/SRR8315181.bam 337623 200 75
## SRR8315182 data/ER/SRR8315182.bam 341998 200 75
## SRR8315183 data/ER/SRR8315183.bam 315872 200 75
## SRR8315184 data/ER/SRR8315184.bam 352908 200 75
## SRR8315185 data/ER/SRR8315185.bam 347709 200 75
Let’s ensure our sample metadata is consistent between this object and samples$ER
colData(er_se) <- colData(er_se) %>%
as_tibble(rownames = "accession") %>%
left_join(samples$ER) %>%
as("DataFrame")
The use of fixed-width and centred ranges minimises bias which might be between samples with different signal characteristics, but importantly, the counts from these centred ranges as simply our proxy for the ranges defined earlier.
Again, using colToRanges() we can place our original set of union peaks back as the GRanges element underlying the RangedSummarizedExperiment.
rowRanges(er_se) <- rowRanges(er_se) %>%
colToRanges(var = "union_peak") %>%
select(region, starts_with("gene"))
Now that we’ve loaded counts, we can explore and visualise them in several ways.
The first might be to check the count distributions, using plotAssayDensities().
plotAssayDensities(er_se, trans = "log1p", colour = "treatment") +
scale_colour_manual(values = treat_colours)
Figure 6: Densities of raw counts for all centred regions, using the ‘log1p’ transformation
We might follow this with a PCA
plotAssayPCA(
er_se, trans = "log1p", colour = "treatment", label = "accession"
) +
scale_colour_manual(values = treat_colours)
Figure 7: PCA plot for all ER samples
An important QC metric as an RLE plot which finds the median value across all samples & ranges to identify any samples with consistently higher/lower or divergent signal. By way of example, we may wish to normalise out counts using logCPM values for this purpose, so let’s perform this step first
assay(er_se, "logCPM") <- cpm(
assay(er_se, "counts"), lib.size = er_se$totals, log = TRUE
)
plotAssayRle(er_se, assay = "logCPM", fill = "treatment") +
geom_hline(yintercept = 0, linetype = 2) +
scale_fill_manual(values = treat_colours)
Figure 8: RLE plot for all ER samples
A problem sometimes faced in ChIP-Seq data which is less prevalent in RNA-Seq data that counts may be drawn from different distributions under different treatments, as DNA-binding is either increased or decreased globally by a treatment. We can simply view all samples summarised by treatment group in order to highlight any possible global differences.
plotAssayRle(er_se, assay = "logCPM", fill = "treatment", by_x = "treatment") +
geom_hline(yintercept = 0, linetype = 2) +
scale_fill_manual(values = treat_colours)
Figure 9: RLE plot for ER samples grouped by treatment
Performing Differential Signal Analysis is now simple using fitAssayDiff(), which offers two methods of analysis: method = "qlf" (default) or method = "lt".
Respectively, these apply Quasi-Likelihood fits using edgeR (Lund et al. 2012), as is appropriate for count data, or limma-trend (Law et al. 2014) if fitting on the log-transformed, (e.g. logCPM) data.
By default, this will return a SummarizedExperiment with results added to the rowRanged()/rowData() element.
Alternatively, we can return a GRanges object by setting asRanges = TRUE
X <- model.matrix(~treatment, data = colData(er_se))
er_results <- fitAssayDiff(er_se, design = X, asRanges = TRUE)
The default normalisation is to use Library Size as contained in the totals column.
However, by setting lib.size = NULL then colSums() will be used if this is considered more suitable.
Additionally, any of the normalisation methods available in edgeR::calcNormFactors() can be applied.
Providing a column name to the groups argument will also normalise within groups, instead of across the entire dataset.
er_results <- fitAssayDiff(er_se, design = X, norm = "TMM", asRanges = TRUE)
Finally, a range-based H0 (McCarthy and Smyth 2009) can be applied by provided a fold-change value below which we consider change to be uninteresting, and we can also automatically add Differential Signal status.
er_results <- fitAssayDiff(
er_se, design = X, norm = "TMM", asRanges = TRUE, fc = 1.2
) %>%
addDiffStatus()
arrange(er_results, PValue)
## GRanges object with 188 ranges and 8 metadata columns:
## seqnames ranges strand | region
## <Rle> <IRanges> <Rle> | <factor>
## [1] chr10 81101906-81102928 * | Upstream Promoter (-5kb)
## [2] chr10 79629641-79630271 * | Promoter (-2500/+500)
## [3] chr10 89407752-89408138 * | Intron
## [4] chr10 52233596-52233998 * | Exon
## [5] chr10 51547205-51547782 * | Promoter (-2500/+500)
## ... ... ... ... . ...
## [184] chr10 93882729-93883168 * | Intron
## [185] chr10 71344398-71344749 * | Intergenic (<10kb)
## [186] chr10 76586249-76586887 * | Promoter (-2500/+500)
## [187] chr10 82725993-82726916 * | Intergenic (>10kb)
## [188] chr10 64605136-64605347 * | Intron
## gene_id
## <CharacterList>
## [1] ENSG00000108179.14_6
## [2] ENSG00000151208.17_11
## [3] ENSG00000225913.2_9,ENSG00000196566.2_10
## [4] ENSG00000198964.14_10,ENSG00000225303.2_10
## [5] ENSG00000263639.7_7
## ... ...
## [184] ENSG00000107864.15_12
## [185] ENSG00000230755.1_6
## [186] ENSG00000272748.1_8,ENSG00000156650.14_17
## [187] ENSG00000227209.1_5
## [188] ENSG00000289487.1_1
## gene_name logFC logCPM PValue
## <CharacterList> <numeric> <numeric> <numeric>
## [1] PPIF 1.887377 8.02941 6.00421e-11
## [2] DLG5 0.966791 8.23125 4.13714e-06
## [3] ENSG00000225913,ENSG00000196566 1.712374 6.20441 1.00877e-05
## [4] SGMS1,ENSG00000225303 1.129786 6.70351 2.34471e-04
## [5] MSMB 0.664784 7.44213 1.13493e-03
## ... ... ... ... ...
## [184] CPEB3 -0.00513719 6.39992 0.985018
## [185] MTATP6P23 -0.00519869 6.77464 0.985285
## [186] ENSG00000272748,KAT6B -0.00909379 8.12097 0.985618
## [187] WARS2P1 -0.00371918 9.43978 0.994243
## [188] ENSG00000289487 0.00109980 5.96820 0.997631
## FDR status
## <numeric> <factor>
## [1] 1.12879e-08 Increased
## [2] 3.88891e-04 Increased
## [3] 6.32163e-04 Increased
## [4] 1.10201e-02 Increased
## [5] 4.26732e-02 Increased
## ... ... ...
## [184] 0.996217 Unchanged
## [185] 0.996217 Unchanged
## [186] 0.996217 Unchanged
## [187] 0.997631 Unchanged
## [188] 0.997631 Unchanged
## -------
## seqinfo: 1 sequence from hg19 genome
Given these results are in a GRanges object, we can produce an MA-plot using conventional ggplot2 approaches.
status_colours <- c(
Undetected = "grey70", Unchanged = "grey30",
Decreased = "blue", Increased = "red"
)
er_results %>%
as_tibble() %>%
ggplot(aes(logCPM, logFC)) +
geom_point(aes(colour = status)) +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_manual(values = status_colours)
Figure 10: MA-plot for ERa analysis using fixed-width windows, TMM normalisation and setting 20% as the range beyond which we consider diferential signal to be of interest
Additionally, we might like to compare Differential Signal Status against the regions we defined earlier and
plotSplitDonut() is well suited to this.
A key advantage of plotSplitDonut() is the ability to specify inner and outer palettes separately
er_results %>%
plotSplitDonut(
inner = "region", outer = "status",
inner_palette = region_colours, outer_palette = status_colours
)
Figure 11: Split-Donut plot showing distribution of peaks with diferential signal
Once again, labels are controlled using the glue syntax and segments are able to be exploded based on matches to a regular expression
er_results %>%
plotSplitDonut(
inner = "region", outer = "status",
inner_palette = region_colours, outer_palette = status_colours,
inner_glue = "n = {n}", inner_min_p = 0, inner_label_size = 3.5,
outer_glue = "{str_wrap(.data[[inner]], 15)}\n{.data[[outer]]}\nn = {n}",
outer_min_p = 0, outer_max_p = 0.03,
explode_outer = "Increased|Decreased", explode_r = 0.3
)
Figure 12: Split-Donut plot exploding segments with evidence of differential signal, and modifying segment labels
fitAssayDiff() is agnostic to sliding or fixed-width windows.
Whilst fixed-width windows are relatively simple conceptually, sliding windows offer other advantages such as being able to manage data where signal comes from genomic regions which are highly variable in width, such as histone marks like H3K27ac.
Sliding window analysis is well-implemented in csaw, however extraChIPs offers the additional ability to combine ranges using asymptotically correct harmonic mean p-value (Wilson 2019).
The step of moving from genomic windows to a subset of windows which we can confidently assume contain signal is also able to be managed by extraChIPs using dualFilter().
For this section of the analysis, we’ll use data for the histone mark H3K27ac from the same cell type under the same two conditions (E2 + E2DHT).
A new sample metadata sheet can be formed using the following, and we’ll immediately for the BamFileList for reading in counts.
acc <- c(samples$H3K27ac$accession, samples$H3K27ac$input) %>%
unique()
h3k_bfl <- file.path("data", "H3K27ac", paste0(acc, ".bam")) %>%
BamFileList() %>%
setNames(acc)
Given we’re going to use sliding windows, we can use a smaller window size and need to set a step-size for ‘sliding’ along the genome. Whilst there are no hard and fast rules, for a histone marks, window sizes of 150bp sliding in steps of 50bp may be a good choice.
win_size <- 150
win_step <- 50
The example dataset for today can be easily managed on a laptop, however, it should be noted that for a full-scale dataset, large amounts of RAM are required, and this process may take considerable time.
In the following we define some key parameters for passing to csaw::windowCounts(), where we restrict counting to chr10, provide out black and grey-lists and let the function know that these are all single-end reads.
From there, we perform the counts using the window step & size as defined above, extending each read to represent a full DNA fragment of 200nt, and discarding any regions with fewer than 1 read/sample.
rp <- readParam(pe = "none", restrict = "chr10", discard = bg_list)
wincounts <- windowCounts(
bam.files = h3k_bfl,
spacing = win_step, width = win_size, ext = 200,
filter = length(h3k_bfl),
param = rp
)
seqlevels(wincounts) <- seqlevels(sq)
seqinfo(wincounts) <- sq
dim(wincounts)
## [1] 467904 7
Many of these windows will have counted reads which are essentially background noise, and we wish to exclude these and only retain those which confidently contain signal from our ChIP target.
extraChIPs provides a function dualFilter() in which a set of guide ranges can be provided where signal is expected.
Thresholds for minimum overall signal, and signal relative to input are set based on retaining a give proportion of reads which overlap these ranges, with lower numbers returning more stringently selected ranges.
A set of guide ranges has been prepared following a similar process to forming union peaks above for ER\(\alpha\), and can be loaded then passed to dualFilter().
Again, this dataset is easily handled on a laptop, whilst for a complete genome, considerable amounts of RAM will be required.
guide_ranges <- file.path("data", "H3K27ac", "H3K27ac_chr10.bed") %>%
import.bed(seqinfo = sq)
filtcounts <- dualFilter(
x = wincounts[, samples$H3K27ac$accession],
bg = wincounts[, unique(samples$H3K27ac$input)],
ref = guide_ranges, q = 0.6
)
colData(filtcounts) <- colData(filtcounts) %>%
as_tibble(rownames = "accession") %>%
left_join(samples$H3K27ac) %>%
as("DataFrame")
dim(filtcounts)
## [1] 19017 6
By default, an additional assay logCPM will be added during this step and this can also be used for QC.
A <- plotAssayDensities(filtcounts, assay = "logCPM", colour = "treatment") +
scale_colour_manual(values = treat_colours)
B <- plotAssayPCA(
filtcounts, assay = "logCPM", colour = "treatment", label = "accession"
) +
scale_colour_manual(values = treat_colours)
A + B + plot_layout(guides = "collect") + plot_annotation(tag_levels = "A")
Figure 13: A) Density plot and B) PCA based on logCPM values from the sliding windows retained after filtering
Now we can pass our filtered windows to fitAssayDiff() as we did previously, however, once we have performed statistical testing on each window, we need to merge windows to obtain our final results.
X <- model.matrix(~treatment, data = colData(filtcounts))
h3k_all <- fitAssayDiff(filtcounts, norm = "TMM", design = X, asRanges = TRUE)
Although we haven’t declared in regions as showing differential signal, we can still inspect an MA plot for any unexpected issues in our data
h3k_all %>%
as_tibble() %>%
ggplot(aes(logCPM, logFC)) +
geom_point() +
geom_smooth(se = FALSE)
Figure 14: MA plot using all sliding windows retained after the initial filtering step
As expected with H3K27ac signal, the total amount of signal obtained under two related conditions is similar and we have a nicely centred plot.
Now we can merge overlapping windows to obtain our final results.
Using extraChIPs this can be performed using all methods offered by csaw, (?mergeByCol and ?mergeBySig) as well as using the asymptotically-exact harmonic mean p-value, which we’ll use here.
h3k_results <- mergeByHMP(h3k_all, min_win = 2, merge_within = win_size + 1) %>%
addDiffStatus()
h3k_results
## GRanges object with 643 ranges and 9 metadata columns:
## seqnames ranges strand | n_windows n_up n_down
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## [1] chr10 42862951-42863350 * | 6 5 0
## [2] chr10 43047801-43048200 * | 6 0 3
## [3] chr10 43048401-43048800 * | 6 0 3
## [4] chr10 43132901-43134800 * | 36 13 16
## [5] chr10 43135701-43137250 * | 29 7 3
## ... ... ... ... . ... ... ...
## [639] chr10 99473501-99474250 * | 13 0 5
## [640] chr10 99496401-99497850 * | 27 9 18
## [641] chr10 99608301-99610050 * | 32 0 8
## [642] chr10 99628751-99629400 * | 11 2 6
## [643] chr10 99894101-99895400 * | 24 4 13
## keyval_range logCPM logFC hmp hmp_fdr
## <GRanges> <numeric> <numeric> <numeric> <numeric>
## [1] chr10:42863051-42863200 6.31662 0.3345653 0.4041174 0.997968
## [2] chr10:43048001-43048150 6.16996 -0.0780895 0.8764236 0.997968
## [3] chr10:43048601-43048750 6.31199 -0.2619553 0.6083939 0.997968
## [4] chr10:43134251-43134400 7.67436 0.1350775 0.5145484 0.997968
## [5] chr10:43137101-43137250 6.87832 0.0791231 0.0119809 0.126290
## ... ... ... ... ... ...
## [639] chr10:99473851-99474000 6.43974 -0.4427309 0.2576771 0.815392
## [640] chr10:99497701-99497850 7.41777 0.0551099 0.9923554 0.997968
## [641] chr10:99608601-99608750 7.10450 -0.5008959 0.0309346 0.236797
## [642] chr10:99628851-99629000 6.63987 -0.2298606 0.6588732 0.997968
## [643] chr10:99895051-99895200 8.08365 -0.1143540 0.6362419 0.997968
## status
## <factor>
## [1] Unchanged
## [2] Unchanged
## [3] Unchanged
## [4] Unchanged
## [5] Unchanged
## ... ...
## [639] Unchanged
## [640] Unchanged
## [641] Unchanged
## [642] Unchanged
## [643] Unchanged
## -------
## seqinfo: 1 sequence from hg19 genome
Notice that we have additional columns not returned when using fixed-width windows.
The first three indicate how many original windows were merged, along with how many may have been individually considered as showing differential signal in either direction.
Next to these is the column keyval_range which indicates the original window which returned the lowest p-value, or by setting keyval = "merged", will return the range containing all original windows with raw p-values below the harmonic-mean p-value.
Given the harmonic mean involves summing the inverse of raw p-values, these are considered to be weights and the returned values for logCPM and logFC are also weighted averages, using \(w_i = \frac{1}{p_i}\) as weights for each windows \(i\). Given this, any MA-plots may appear to show bias, which is in fact due to the weighting strategy used here. As such the previous figure using all windows prior to merging is the best strategy for identifying any issues with the data
h3k_results %>%
as_tibble() %>%
ggplot(aes(logCPM, logFC)) +
geom_point(aes(colour = status), alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
scale_colour_manual(values = status_colours)
Figure 15: MA plot showing results after merging overlapping windows
extraChIPs provide additional capacity for comparing between two or more sets of results.
The key structure for this is a GRangesList, so let’s ensure we have compatible column names, then form a combined object.
h3k_results <- h3k_results %>%
select(logCPM, logFC, PValue = hmp, FDR = hmp_fdr, status) %>%
mutate(region = bestOverlap(., unlist(regions), var = "region")) %>%
mapByFeature(genes = gtf$gene, prom = regions$promoter)
both_results <- GRangesList(ER = er_results, H3K27ac = h3k_results)
Now we have a combined GRangesList object, we can plot columns from each element against the other.
If we wish to compare signal in sites where both are detected, we can use plotPairwise(), which will also show boxplots of signal for where only one target is detected, alongside those where both targets are detected
plotPairwise(both_results, var = "logCPM")
Figure 16: Comparison of logCPM values when both targets are detected, with boxplots showing the range of values when each target was detected alone, or when both were detected together
Although we only have a handful of sites in our example dataset, one might easily suspect that signal for both targets is higher when both are detected together.
Alternatively, we might wish to see which sites are changing together, or where only one is changing.
This time we’ll plot logFC, colouring points by the combined Differential Signal status in each element, and adding labels from the gene_name column.
Labels are only added to the sites in each combined group which are furthest from zero.
plotPairwise(
both_results, var = "logFC", colour = "status", label = "gene_name"
) +
scale_colour_manual(
values = c("lightskyblue", "palevioletred", "grey", "darkred")
) +
scale_fill_manual(
values = c(
setNames(status_colours, paste("ER", names(status_colours))),
setNames(status_colours, paste("H3K27ac", names(status_colours)))
)
)
Figure 17: Comparison of logFC values when both targets are detcted, with boxplots showing the range of values when only one target is detected, or when both are detected
We can also compare columns by performing an operation somewhat analogous to pivot_wider where we find overlapping ranges and add columns from each element to a set of consensus peaks.
mapGrlCols(both_results, var = "status")
## GRanges object with 741 ranges and 2 metadata columns:
## seqnames ranges strand | ER_status H3K27ac_status
## <Rle> <IRanges> <Rle> | <factor> <factor>
## [1] chr10 42862951-42863350 * | NA Unchanged
## [2] chr10 43047801-43048800 * | Unchanged Unchanged
## [3] chr10 43047801-43048800 * | Unchanged Unchanged
## [4] chr10 43132901-43134800 * | NA Unchanged
## [5] chr10 43135701-43137250 * | NA Unchanged
## ... ... ... ... . ... ...
## [737] chr10 99496401-99497850 * | NA Unchanged
## [738] chr10 99608301-99610050 * | NA Unchanged
## [739] chr10 99621632-99621988 * | Unchanged NA
## [740] chr10 99628751-99629400 * | NA Unchanged
## [741] chr10 99894101-99895400 * | NA Unchanged
## -------
## seqinfo: 1 sequence from hg19 genome
We can use this to quickly find sites which are of interest in both sets of results, along with their targets
both_results %>%
mapGrlCols(var = c("region", "status", "FDR"), collapse = "gene_name") %>%
filter(
str_detect(ER_status, "Increased|Decreased"),
str_detect(H3K27ac_status, "Increased|Decreased")
)
## GRanges object with 3 ranges and 7 metadata columns:
## seqnames ranges strand | ER_region
## <Rle> <IRanges> <Rle> | <factor>
## [1] chr10 71879101-71881150 * | Promoter (-2500/+500)
## [2] chr10 79623301-79635950 * | Promoter (-2500/+500)
## [3] chr10 81101651-81103150 * | Upstream Promoter (-5kb)
## H3K27ac_region ER_status H3K27ac_status ER_FDR H3K27ac_FDR
## <factor> <factor> <factor> <numeric> <numeric>
## [1] Promoter (-2500/+500) Increased Increased 4.46675e-02 8.68739e-03
## [2] Promoter (-2500/+500) Increased Increased 3.88891e-04 7.76644e-06
## [3] Upstream Promoter (-5kb) Increased Increased 1.12879e-08 1.60656e-05
## gene_name
## <character>
## [1] AIFM2
## [2] DLG5; ENSG00000204049
## [3] PPIF
## -------
## seqinfo: 1 sequence from hg19 genome
This also makes comparison of sites easy using plotSplitDonut() and in the following we’ll only show sites where both are detected.
both_results %>%
mapGrlCols(var = "status") %>%
as_tibble() %>%
dplyr::filter(!if_any(ends_with("status"), is.na)) %>%
dplyr::rename_with(\(x) str_remove_all(x, "_status")) %>%
plotSplitDonut(
inner = "H3K27ac", outer = "ER",
inner_palette = status_colours, outer_palette = status_colours,
inner_glue = "{inner}\n{.data[[inner]]}\n{n}",
outer_glue = "{outer}\n{.data[[outer]]}\n{n}",
label_alpha = 0.8, min_p = 0.02,
explode_inner = "Increased", explode_outer = "Increased",
explode_query = "AND", explode_r = 0.3
)
Figure 18: Combined status when both targets are detected together
Perhaps we might wish to look at the combined status in reference to the genomic regions we defined previously. Here, we’ll scale the site by width to show results in kb, instead of counting sites.
both_results %>%
mapGrlCols(var = c("region", "status")) %>%
as_tibble(rangeAsChar = FALSE) %>%
dplyr::filter(!if_any(ends_with("status"), is.na)) %>%
mutate(
ER_status = fct_relabel(ER_status, \(x) paste("ER", x)),
H3K27ac_status = fct_relabel(H3K27ac_status, \(x) paste("H3K27ac", x)),
combined = fct_cross(ER_status, H3K27ac_status, sep = "\n")
) %>%
plotSplitDonut(
scale_by = "width",
inner = "H3K27ac_region", outer = "combined",
inner_palette = region_colours,
total_glue = "{comma(N)}kb",
inner_glue = "{str_wrap(.data[[inner]], 15)}\n{comma(n, 1)}kb",
outer_glue = "{.data[[outer]]}\n{comma(n, 0.1)}kb",
outer_min_p = 0.01, #outer_max_p = 0.02,
explode_outer = "(Inc|Dec).+(Inc|Dec)", explode_r = 0.4,
label_alpha = 0.7
)
Figure 19: Combined status shown by the region, as defined by the best overlap for H3K27ac signal
Regions here are scaled by width and totals shown in kb
As well the above strategies for inspecting results, looking directly at the sites using a genome browser is a common strategy.
extraChIPs provides the function plotHFGC() which enables us to take a genome level viewpoint of any 1) HiC Interactions, 2) Features, 3) Gene Models and 4) Coverage.
All tracks are optional and will be displayed in this order when provided using the various track types provided by Gviz (Hahne and Ivanek 2016).
In order to use this approach to scan through our results choosing one ranges at a time, let’s define the elements that we have starting with our genomic regions, which we would consider to be a set of genomic features.
For this track, we’ll need a GRangesList() and each element will be filled a different colour, meaning we’ll need a matching vector of colours.
We already have both of these from earlier on our workflow, but we will need to change the names on our colour vector
names(region_colours) <- names(regions)
Next we’ll setup our gene models in the format that Gviz is familiar with, using the exons element from our GTF annotations.
gene_models <- gtf$exon %>%
select(
type, gene = gene_id, exon = exon_id, transcript = transcript_id,
symbol = gene_name
)
This leaves coverage as our final track, or set of tracks.
plotHFGC() takes BigWigFileList objects for coverage and if passing a single BigwigFileList, each element will be drawn on a separate track.
For our results we have both ER and H3K27ac so a more informative plot might be to overlay the two treatments in a single track for each target.
To facilitate this, we need a named list of BigwigFileList objects and a matching named list of colours.
In each of our data directories we have summarised coverage for E2 and E2DHT
targets <- c("ER", "H3K27ac")
bwfl <- list(
ER = file.path("data", "ER", glue("{treat_levels}_cov_chr10.bw")),
H3K27ac = file.path("data", "H3K27ac", glue("{treat_levels}_cov_chr10.bw"))
) %>%
lapply(setNames, treat_levels) %>%
lapply(BigWigFileList)
line_colours <- lapply(bwfl, function(x) treat_colours[names(x)])
Now we have our track setup, we’ll load the cytogenetic bands provided with extraChIPs to ensure we have a band at the top of the plot, although this too is optional.
data("grch37.cytobands")
And finally, let’s select our first range as the most highly ranked site for change ER binding, which also appears to show an increase in H3K27ac signal
gr <- both_results %>%
mapGrlCols(
var = c("region", "status", "FDR"), collapse = "gene_name"
) %>%
arrange(ER_FDR) %>%
.[1]
Now we pass all of these to plotHFGC() zooming out a little to try and get some kind of wider context.
plotHFGC(
gr, cytobands = grch37.cytobands,
features = regions, featcol = region_colours,
genes = gene_models, genecol = "wheat",
coverage = bwfl, linecol = line_colours,
zoom = 10
)
Figure 20: Default plot from plotHFGC showing both ER and H3K27ac signal on separate tracks, with the two treatment groups overlaid as separately coloured lines
One helpful feature of is the ability to annotate our coverage tracks to indicate statistical significance.
If we simply split our initial set of results into GRangesList objects based on their status, we can use these to add coloured bars above our coverage tracks to denote that signal was detected there, and what the results from analysis were.
We can simply pass our vector of status colours that we’ve already defined for these
cov_annot <- list(
ER = splitAsList(er_results, er_results$status),
H3K27ac = splitAsList(h3k_results, h3k_results$status)
)
Now let’s add these annotations and tweak the appearance of the plot
plotHFGC(
gr, cytobands = grch37.cytobands,
features = regions, featcol = region_colours, featstack = "dense",
genes = gene_models, genecol = "wheat",
coverage = bwfl, linecol = line_colours,
annotation = cov_annot, annotcol = status_colours,
zoom = 10, featsize = 2, title.width = 0.9,
cex.axis = 1, cex.title = 1, rotation.title = 90, fontsize = 12
)
Figure 21: Customised output from plotHFGC, adding annotations for each site as well as tweaking the labels and font sizes
Just like we were able to provide a separate BigWigFileList to indicate separate tracks for our targets, we can use a similar strategy for the genes and features tracks.
GRangesList of gene will draw each element on a separate track. This may be useful for indicating results from a Differential Gene Expression analysis by plotting up-regulated and down-regulated genes on separate tracks in different coloursGRangesList objects to the features track will add each element of the list as a separate tracks. In this way you may wish to add multiple features, such as a chromHMM track along with CTCF or other known binding sites.A final way of visualising results may be to draw a Profile Heatmap, where we show coverage or fold-enrichment over background in the surrounding regions.
Instead of separating our BigWigFileList into separate tracks, let’s just use a single BigWigFileList with coverage for both ER and H3K27ac.
profile_bwfl <- bwfl %>%
lapply(path) %>%
unlist() %>%
setNames(str_replace_all(names(.), "\\.", " ")) %>%
BigWigFileList()
Given we only have a small dataset, let’s just choose the top 20 sites by logFC in the ER dataset, and draw Profile Heatmaps for both targets.
We first make a call to getProfileData() which smooths the data into bins around the site
er_sig <- arrange(er_results, -logFC)[1:20]
pd <- getProfileData(profile_bwfl, er_sig)
We can now pass this to plotProfileHeatmap() and can use any of the categorical columns in the original set of ranges to facet our results.
Here we’ll facet by status in the ER dataset, which will also draw lines in separate colours.
plotProfileHeatmap(pd, "profile_data", facetY = "status") +
scale_colour_manual(values = status_colours) +
scale_fill_gradient(low = "white", high = "red") +
labs(fill = "logCPM")
Figure 22: Profile Heatmap for the 20 most highly ranked sites for increased ER binding
Ranges are centred around the ER binding patterns. The double peak seen earlier in the output from plotHFGC is at the top of the first panel.`
A complete dataset would provide a far cleaner set of results, but we can still see the general increases in ER signal and how these relate to H3K27ac signal.
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Adelaide
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.1.2 rtracklayer_1.60.0
## [3] edgeR_3.42.4 limma_3.56.2
## [5] csaw_1.34.0 Rsamtools_2.16.0
## [7] Biostrings_2.68.1 XVector_0.40.0
## [9] extraChIPs_1.5.9 SummarizedExperiment_1.30.2
## [11] Biobase_2.60.0 MatrixGenerics_1.12.2
## [13] matrixStats_1.0.0 ggside_0.2.2
## [15] BiocParallel_1.34.2 plyranges_1.20.0
## [17] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1
## [19] IRanges_2.34.1 S4Vectors_0.38.1
## [21] BiocGenerics_0.46.0 glue_1.6.2
## [23] lubridate_1.9.2 forcats_1.0.0
## [25] stringr_1.5.0 dplyr_1.1.2
## [27] purrr_1.0.1 readr_2.1.4
## [29] tidyr_1.3.0 tibble_3.2.1
## [31] ggplot2_3.4.2 tidyverse_2.0.0
## [33] BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.1 BiocIO_1.10.0
## [3] bitops_1.0-7 filelock_1.0.2
## [5] polyclip_1.10-4 XML_3.99-0.14
## [7] rpart_4.1.19 lifecycle_1.0.3
## [9] doParallel_1.0.17 lattice_0.21-8
## [11] ensembldb_2.24.0 MASS_7.3-60
## [13] backports_1.4.1 magrittr_2.0.3
## [15] Hmisc_5.1-0 sass_0.4.7
## [17] rmarkdown_2.23 jquerylib_0.1.4
## [19] yaml_2.3.7 metapod_1.8.0
## [21] Gviz_1.44.0 DBI_1.1.3
## [23] RColorBrewer_1.1-3 abind_1.4-5
## [25] zlibbioc_1.46.0 AnnotationFilter_1.24.0
## [27] biovizBase_1.48.0 RCurl_1.98-1.12
## [29] nnet_7.3-19 tweenr_2.0.2
## [31] VariantAnnotation_1.46.0 rappdirs_0.3.3
## [33] circlize_0.4.15 GenomeInfoDbData_1.2.10
## [35] ggrepel_0.9.3 codetools_0.2-19
## [37] DelayedArray_0.26.6 xml2_1.3.5
## [39] ggforce_0.4.1 tidyselect_1.2.0
## [41] shape_1.4.6 futile.logger_1.4.3
## [43] farver_2.1.1 ComplexUpset_1.3.3
## [45] BiocFileCache_2.8.0 base64enc_0.1-3
## [47] GenomicAlignments_1.36.0 jsonlite_1.8.7
## [49] GetoptLong_1.0.5 Formula_1.2-5
## [51] iterators_1.0.14 foreach_1.5.2
## [53] tools_4.3.1 progress_1.2.2
## [55] Rcpp_1.0.11 gridExtra_2.3
## [57] mgcv_1.8-42 xfun_0.39
## [59] withr_2.5.0 formatR_1.14
## [61] BiocManager_1.30.21.1 fastmap_1.1.1
## [63] latticeExtra_0.6-30 fansi_1.0.4
## [65] digest_0.6.33 timechange_0.2.0
## [67] R6_2.5.1 colorspace_2.1-0
## [69] jpeg_0.1-10 dichromat_2.0-0.1
## [71] biomaRt_2.56.1 RSQLite_2.3.1
## [73] utf8_1.2.3 generics_0.1.3
## [75] data.table_1.14.8 prettyunits_1.1.1
## [77] InteractionSet_1.28.1 httr_1.4.6
## [79] htmlwidgets_1.6.2 S4Arrays_1.0.5
## [81] pkgconfig_2.0.3 gtable_0.3.3
## [83] blob_1.2.4 ComplexHeatmap_2.16.0
## [85] htmltools_0.5.5 bookdown_0.34
## [87] ProtGenerics_1.32.0 clue_0.3-64
## [89] scales_1.2.1 png_0.1-8
## [91] knitr_1.43 lambda.r_1.2.4
## [93] rstudioapi_0.15.0 tzdb_0.4.0
## [95] rjson_0.2.21 nlme_3.1-162
## [97] checkmate_2.2.0 curl_5.0.1
## [99] cachem_1.0.8 GlobalOptions_0.1.2
## [101] parallel_4.3.1 foreign_0.8-84
## [103] AnnotationDbi_1.62.2 restfulr_0.0.15
## [105] pillar_1.9.0 grid_4.3.1
## [107] vctrs_0.6.3 dbplyr_2.3.3
## [109] cluster_2.1.4 htmlTable_2.4.1
## [111] evaluate_0.21 VennDiagram_1.7.3
## [113] EnrichedHeatmap_1.30.0 GenomicFeatures_1.52.1
## [115] cli_3.6.1 locfit_1.5-9.8
## [117] compiler_4.3.1 futile.options_1.0.1
## [119] rlang_1.1.1 crayon_1.5.2
## [121] labeling_0.4.2 interp_1.1-4
## [123] stringi_1.7.12 deldir_1.0-9
## [125] munsell_0.5.0 lazyeval_0.2.2
## [127] Matrix_1.6-0 BSgenome_1.68.0
## [129] hms_1.1.3 bit64_4.0.5
## [131] KEGGREST_1.40.0 highr_0.10
## [133] igraph_1.5.0.1 broom_1.0.5
## [135] memoise_2.0.1 bslib_0.5.0
## [137] bit_4.0.5 GenomicInteractions_1.34.0